INTERSPEECH.2006 - Language and Multimodal

Total: 70

#1 Robust interpretation in dialogue by combining confidence scores with contextual features [PDF] [Copy] [Kimi]

Authors: Matthew Purver ; Florin Ratiu ; Lawrence Cavedon

We present an approach to dialogue management and interpretation that evaluates and selects amongst candidate dialogue moves based on features at multiple levels. Multiple interpretation methods can be combined, multiple speech recognition and parsing hypotheses tested, and multiple candidate dialogue moves considered to choose the highest scoring hypothesis overall. We integrate hypotheses generated from shallow slot-filling methods and from relatively deep parsing, using pragmatic information. We show that this gives more robust performance than using either approach alone, allowing n-best list reordering to correct errors in speech recognition or parsing.

#2 A clustering approach to semantic decoding [PDF] [Copy] [Kimi]

Authors: Hui Ye ; Steve Young

This paper presents a novel algorithm for semantic decoding in spoken language understanding systems. Unlike conventional semantic parsers which either use hand-crafted rules or statistical models trained from fully annotated data, the proposed approach uses an unsupervised sentence clustering technique called Y-clustering to automatically select a set of exemplar sentences from a training corpus. These exemplars are combined with simple sentence-level semantic annotations to form templates which are then used for semantic decoding. The performance of this approach was evaluated in the travel domain using the ATIS corpus. Training is fast and cheap, and the results are significantly better than those achieved using HMM-based or stack-based statistical parsers.

#3 A bootstrapping approach for developing language model of new spoken dialogue systems by selecting web texts [PDF] [Copy] [Kimi]

Authors: Teruhisa Misu ; Tatsuya Kawahara

This paper proposes a bootstrapping method of constructing statistical language models for new spoken dialogue systems by collecting and selecting sentences from the World Wide Web (WWW). To make effective search queries that cover the target domain in full detail, we exploit the document set described about the target domain as seeding data. An important issue is how to filter the retrieved Web pages, since all of the retrieved Web texts are not necessarily suitable as training data. We induct an existing dialogue corpus of different domain to prefer the texts of spoken style. The proposed method was evaluated on two different tasks of software support and sightseeing guidance, and significant reduction of the word error rate was achieved. We show that it is vital to incorporate the dialogue corpus, though not relevant to the target domain, in the text selection phase.

#4 Phoneme-to-grapheme mapping for spoken inquiries to the semantic web [PDF] [Copy] [Kimi]

Authors: Axel Horndasch ; Elmar Nöth ; Anton Batliner ; Volker Warnke

Automatic methods for grapheme-to-phoneme (G2P) and phoneme-to-grapheme (P2G) conversion have become very popular in recent years. Their performance has improved considerably, while at the same time these developments required less input from expert lexicographers. Continuing in this tradition we will present in this paper a data-driven, language-independent approach called MASSIVE with which it is possible to create efficient online modules for automatic symbol mapping. Our framework is solely based on statistical methods for training and run-time and has been optimized for P2G conversion in the context of spoken inquiries to the Semantic Web, an issue researched in the SmartWeb project. MASSIVE systems can be trained using a pronunciation lexicon, the output of a phone recognizer or any other suitable set of corresponding symbol strings. Successful tests have been performed on German and English data sets.

#5 Bootstrapping language models for dialogue systems [PDF] [Copy] [Kimi]

Authors: Karl Weilhammer ; Matthew N. Stuttle ; Steve Young

We report results on rapidly building language models for dialogue systems. Our base line is a recogniser using a grammar network. We show that we can almost halve the word error rate (WER) by combining language models generated from a simple task grammar with a standard speech corpus and data collected from the web using a sentence selection algorithm based on relative perplexity. This model compares very well to a language model using "in-domain" data from a Wizard Of Oz (WOZ) collection. We strongly advocate the use of statistical language models (SLMs) in speech recognisers for dialogue systems and show that costly WOZ data collections are not necessary to build SLMs.

#6 Question answering with discriminative learning algorithms [PDF] [Copy] [Kimi]

Author: Junlan Feng

In this paper, we describe a discriminative learning approach for question answering. Our training corpus consists of (2) million Frequently Asked Questions (FAQs) and their corresponding answers that we mined from the World Wide Web. This corpus is used to train the lexical and semantic association model between questions and answers. We evaluate our approach on two question answering tasks: 2003 Text Retrieval Conference Question Answering task, and finding answers to FAQs. In both cases, the proposed approach achieved significant improvements over the results for an information retrieval based question answering model.

#7 Automatic language identification using wavelets [PDF] [Copy] [Kimi]

Authors: Ana Lilia Reyes-Herrera ; Luis Villaseñor-Pineda ; Manuel Montes-y-Gómez

Spoken language identification consists in recognizing a language based on a sample of speech from an unknown speaker. The traditional approach for this task mainly considers the phonothactic information of languages. However, for marginalized languages - languages with few speakers or oral languages without a fixed writing standard -, this information is practically not at hand and consequently the usual approach is not applicable. In this paper, we present a method that only considers the acoustic features of the speech signal and does not use any kind of linguistic information. This method applies a wavelet transform to extract the acoustic features of the speech signal. The experimental results on a pairwise discrimination task among nine languages demonstrated that this approach considerably outperforms other previous methods based on the sole use of acoustic features.

#8 Minimum classification error training of hidden Markov models for acoustic language identification [PDF] [Copy] [Kimi]

Authors: Josef G. Bauer ; Ekaterina Timoshenko

The goal of acoustic Language Identification (LID) is to identify the language of spoken utterances. The described system is based on parallel Hidden Markov Model (HMM) phoneme recognizers. The standard approach for parameter learning of Hidden Markov Model parameters is Maximum Likelihood (ML) estimation which is not directly related to the classification error rate. Based on the Minimum Classification Error (MCE) parameter estimation scheme we introduce Minimum Language Identification Error (MLIDE) training that results in HMM model parameters (mean vectors) that give minimum classification error on the training data. Using a large telephone speech corpus with 7 languages achieve a language classification error rate of 4.7% which is a 40% reduction of error rate compared with a baseline system using ML trained HMMs. Even if the system trained on fixed network telephone speech is applied to mobile network speech data MLIDE can greatly improve the system performance.

#9 Unsupervised adaptation for acoustic language identification [PDF] [Copy] [Kimi]

Authors: Ekaterina Timoshenko ; Josef G. Bauer

Our system for automatic language identification (LID) of spoken utterances is performed with language dependent parallel phoneme recognition (PPR) using Hidden Markov Model (HMM) phoneme recognizers and optional phoneme language models (LMs). Such a LID system for continuous speech requires many hours of orthographically transcribed data for training of language dependent HMMs and LMs as well as phonetic lexica for every considered language (supervised training). To avoid the time consuming process of obtaining the orthographically transcribed training material we propose an algorithm for automatic unsupervised adaptation that requires only raw audio data as training material covering the requested language and acoustic environment. The LID system was trained and evaluated using fixed and mobile network databases (DBs) from the SpeechDat II corpus. The baseline system - based on supervised training using fixed network databases and covering 4 languages - achieved a LID error rate of 6.7% for fixed data and 19.5% for mobile data. Using unsupervised adaptation of the HMMs trained on fixed network data the error rate for mobile DBs database mismatch is reduced to 10.6%. Exploring a situation when orthographically transcribed training data is not available at all multilingual HMMs were unsupervised adapted to fixed and mobile DBs and perform at 10.8% and 12.4% error rate respectively.

#10 Low complexity LID using pruned pattern tables of LZW [PDF] [Copy] [Kimi]

Authors: S. V. Basavaraja ; T. V. Sreenivas

We present two discriminative language modelling techniques for Lempel-Ziv-Welch (LZW) based LID system. The previous approach to LID using LZW algorithm was to directly use the LZW pattern tables for language modelling. But, since the patterns in a language pattern table are shared by other language pattern tables, confusability prevailed in the LID task. For overcoming this, we present two pruning techniques (i) Language Specific (LS-LZW) - in which patterns common to more than one pattern table are removed. (ii) Length-Frequency product based (LF-LZW) - in which patterns having their length-frequency product below a threshold are removed. These approaches reduce the classification score (Compression Ratio [LZW-CR] or the weighted discriminant score [LZW-WDS]) for non native languages and increases the LID performance considerably. Also the memory and computational requirements of these techniques are much less compared to basic LZW techniques.

#11 Improved language identification using support vector machines for language modeling [PDF] [Copy] [Kimi]

Authors: Xi Yang ; Lu-Feng Zhai ; Manhung Siu ; Herbert Gish

Automatic language identification (LID) decisions are made based on scores of language models (LM). In our previous paper [1], we have shown that replacing n-gram LMs with SVMs significantly improved performance of both the PPRLM and GMM-tokenization-based LID systems when tested on the OGI-TS corpus. However, the relatively small corpus size may limit the general applicability of the findings. In this paper, we extend the SVM-based approach on the larger CallFriend corpus evaluated using the NIST 1996 and 2003 evaluation sets. With more data, we found that SVM is still better than n-gram models. In addition, back-end processing is useful with SVM scores in Call-Friend which differs from our observation in the OGI-TS corpus. By combining the SVM-based GMM and phonotactic systems, our LID system attains an ID error of 12.1% on NIST 2003 evaluation set which is more than 4% (25% relatively) better than the baseline n-gram system.

#13 Fusion of phonotactic and prosodic knowledge for language identification [PDF] [Copy] [Kimi]

Authors: Chi-Yueh Lin ; Hsiao-Chuan Wang

Over the last few decades, language identification systems based on different kinds of linguistic knowledge had been studied by many researchers. Most of systems utilize one kind of linguistic knowledge only, i.e. phonotactic, phonetic repertoire, or prosody. It is possible to get the improvement by combining several linguistic knowledge. However, the combination of two systems based on different kinds of linguistic knowledge is not a trivial task. This paper presents a method where local identification results made by two individual systems, i.e. prosody-based and phonotactic-based systems, are fused in a Bayesian framework. Under this framework, local decisions, the associated false-alarm and miss probabilities are fused via Bayesian formulation to make the final decision. Experiments conducted on OGI-TS corpus demonstrate the effectiveness of this decision-level fusion strategy.

#14 Vector-based spoken language recognition using output coding [PDF] [Copy] [Kimi]

Authors: Haizhou Li ; Bin Ma ; Rong Tong

The vector-based spoken language recognition approach converts a spoken utterance into a high dimensional vector, also known as a bagof- sounds vector, that consists of n-gram statistics of acoustic units. Dimensionality reduction would better prepare the bag-of-sounds vectors for classifier design. We propose projecting the bag-of-sounds vectors onto a low dimensional SVM output coding space, where each dimension represents a decision hyperplane between a pair of spoken languages. We also compare the performances of the output coding approach and the traditional low ranking approximation approach using latent semantic indexing (LSI) on the NIST 1996, 2003 and 2005 Language Recognition Evaluation (LRE) databases. The experiments show that the output coding approach consistently outperforms LSI with competitive results.

#15 Basque-Spanish language identification using phone-based methods [PDF] [Copy] [Kimi]

Authors: Victor G. Guijarrubia ; M. Ines Torres

This paper presents initial experiments in language identification for Spanish and Basque, which are both official languages in the Basque Country in the North of Spain. We focus on three methods based on Hidden Markov Models (HMMs): parallel phone decoding, with no phonotactic knowledge, phone decoder scored by a phonotactic model and single phone decoder scored by a phonotactic model, with phonotactic knowledge. Results for the three techniques are presented, along with others obtained using a neural network classifier. Significant accuracy is achieved when better phonotactic knowledge is used. The use of a neural network classifier results in a slightly improvement and, in overall, similar results are achieved for both languages, with accuracies around 98%.

#16 The role of prosody in the perception of US native English accents [PDF] [Copy] [Kimi]

Authors: Ayako Ikeno ; John H. L. Hansen

A wide range of aspects are contained within the speech signal which provides information concerning a particular speaker's characteristics. Accent is a linguistic trait of speaker identity. It indicates the speaker's language and social background. The goal of this study is to provide perceptual assessment of accent variation in US native English. The main issue considered is how different components of prosody affect accent perception. This perceptual study employed an ASHA certified acoustic sound booth using 73 listeners (53 male, 20 female). The results from these perceptual experiments indicate the importance of prosody in combination with availability of utterance content via speech signal or transcripts. The trends also indicate that listeners' decisions are influenced by conceptual representation of prototypical accent characteristics, such as "people from New York talk fast." These observations suggest that listeners use both bottom-up processing, based on the acoustic input, and top-town processing, based on their conceptual representation of prototypical accent characteristics. Those processes are multi-dimensional in that listeners use utterance content (e.g., meaning or comprehensibility) as well as accent characteristics in the acoustic input even though our experiment focuses on pronunciation features and does not include word selections that are dialect dependent. These findings contribute to a deeper understanding of the cognitive aspects of accent variation, and its applications for speech technology, such as accent classification for speaker identification or speech recognition.

#17 Perceptual identification and phonetic analysis of 6 foreign accents in French [PDF] [Copy] [Kimi]

Authors: Bianca Vieru-Dimulescu ; Philippe Boula de Mareüil

A perceptual experiment was designed to determine to what extent naive French listeners are able to identify foreign accents in French: Arabic, English, German, Italian, Portuguese and Spanish. They succeed in recognizing the speaker's mother tongue in more than 50% of cases (while rating their degree of accentedness as average). They perform best with Arabic speakers and worst with Portuguese speakers. The Spanish/Italian and English/German accents are the most mistaken ones. Phonetic analyses were conducted; clustering and scaling techniques were applied to the results, and were related to the listeners' reactions that were recorded during the test. They support the idea that differences in the vowel realization (especially concerning the phoneme /y/) seem to outweigh rhythmic cues. Gaussian Mixture Selection and Data Selection for

#18 Unsupervised Spanish dialect classification [PDF] [Copy] [Kimi]

Authors: Rongqing Huang ; John H. L. Hansen

Automatic dialect classification has gained interests in the field of speech research because it is important to characterize speaker traits and to estimate knowledge that could improve integrated speech technology (e.g., speech recognition, speaker recognition). This study addresses novel advances in unsupervised spontaneous Latin American Spanish dialect classification. The problem considers the case where no transcripts are available for train and test data, and speakers are talking spontaneously. A technique which aims to find the dialect dependence in the untranscribed audio by selecting the most discriminative Gaussian mixtures and selecting the most discriminative frames of speech is proposed. The Gaussian Mixture Model (GMM) based classifier is retrained after the dialect dependence information is identified. Both the MS-GMM (GMM trained with Mixture Selection) and FS-GMM (GMM trained with Frame Selection) classifiers improve dialect classification performance significantly. Using 122 speakers across three dialects of Spanish with 3.3 hours of speech, the relative error reduction is 30.4% and 26.1% respectively.

#19 Building an English-iraqi Arabic machine translation system for spoken utterances with limited resources [PDF] [Copy] [Kimi]

Authors: Jason Riesa ; Behrang Mohit ; Kevin Knight ; Daniel Marcu

This paper presents an English-Iraqi Arabic speech-to-speech statistical machine translation system using limited resources. In it, we explore the constraints involved, how we endeavored to mitigate such problems as a non-standard orthography and a highly inflected grammar, and discuss leveraging existing plentiful resources for Modern Standard Arabic to assist in this task. These combined techniques yield a reduction in unknown words at translation time by over 40% and a +3.65 increase in BLEU score over a previous state-of-the-art system using the same parallel training corpus of spoken utterances.

#20 A phrase-level machine translation approach for disfluency detection using weighted finite state transducers [PDF] [Copy] [Kimi]

Authors: Sameer Maskey ; Bowen Zhou ; Yuqing Gao

We propose a novel algorithm to detect disfluency in speech by reformulating the problem as phrase-level statistical machine translation using weighted finite state transducers. We approach the task as translation of noisy speech to clean speech. We simplify our translation framework such that it does not require fertility and alignment models. We tested our model on the Switchboard disfluency-annotated corpus. Using an optimized decoder that is developed for phrasebased translation at IBM, we are able to detect repeats, repairs and filled pauses for more than a thousand sentences in less than a second with encouraging results.

#21 Improving phrase-based Korean-English statistical machine translation [PDF] [Copy] [Kimi]

Authors: Jonghoon Lee ; Donghyeon Lee ; Gary Geunbae Lee

In this paper, we describe several techniques to improve Korean- English statistical machine translation. We have built a phrase-based statistical machine translation system in a travel domain. On the baseline phrase-based system, several techniques are applied to improve the translation quality. Each technique can be applied or removed easily since the techniques are part of the preprocessing method or corpus processing method. Our experiments show that most of the techniques were successful except reordering the word sequence. The combination of the successful techniques has significantly improved the translation quality.

#22 A hybrid phrase-based/statistical speech translation system [PDF] [Copy] [Kimi]

Authors: David Stallard ; Fred Choi ; Kriste Krstovski ; Prem Natarajan ; Rohit Prasad ; Shirin Saleem

Spoken communication across a language barrier is of increasing importance in both civilian and military applications. In this paper, we present a system for task-directed 2-way communication between speakers of English and Iraqi colloquial Arabic. The application domain of the system is force protection. The system supports translingual dialogue in areas that include municipal services surveys, detainee screening, and descriptions of people, houses, vehicles, etc. N-gram speech recognition is used to recognize both English and Arabic speech. The system uses a combination of a pre-recorded questions and statistical machine translation with speech synthesis to translate the recognition output.

#23 High-quality speech translation in the flight domain [PDF] [Copy] [Kimi]

Authors: Chao Wang ; Stephanie Seneff

Portability is an important issue to the viability of a domain-specific translation approach. This paper describes an English to Chinese translation system for flight-domain queries, utilizing an interlingua translation framework that has been successfully applied in the weather domain. Portability of various components is tested, and new technologies to handle parse ambiguities and ill-formed inputs are developed to enhance the translation framework. Evaluation of translation quality is conducted manually on a set of 432 unseen flight-domain utterances, which are translated into Chinese using a formal method and a new robust back-off method in tandem. We achieved 96.7% sentence accuracy with a rejection rate of 7.6% on manual transcripts, and 89.1% accuracy with an 8.6% rejection rate on speech input. A game for language learning using the translation capability is currently under development.

#24 Optimizing components for handheld two-way speech translation for an English-iraqi Arabic system [PDF] [Copy] [Kimi]

Authors: Roger Hsiao ; Ashish Venugopal ; Thilo Köhler ; Ying Zhang ; Paisarn Charoenpornsawat ; Andreas Zollmann ; Stephan Vogel ; Alan W. Black ; Tanja Schultz ; Alex Waibel

This paper described our handheld two-way speech translation system for English and Iraqi. The focus is on developing a field usable handheld device for speech-to-speech translation. The computation and memory limitations on the handheld impose critical constraints on the ASR, SMT, and TTS components. In this paper we discuss our approaches to optimize these components for the handheld device and present performance numbers from the evaluations that were an integral part of the project. Since one major aspect of the TransTac program is to build fieldable systems, we spent significant effort on developing an intuitive interface that minimizes the training time for users but also provides useful information such as back translations for translation quality feedback.

#25 Developing an automatic assessment tool for children²s oral reading [PDF] [Copy] [Kimi]

Authors: Leen Cleuren ; Jacques Duchateau ; Alain Sips ; Pol Ghesquière ; Hugo Van hamme

Automation of oral reading assessment and of feedback in a reading tutor is a very challenging task. This paper describes our research aiming at developing such automated systems. First topic is the recording and annotation of CHOREC, the Flemish database of children’s oral reading we develop in order to characterize oral reading processes statistically. Next, we propose a classification of both oral reading strategies and errors, which provides the basis of the envisaged assessment and feedback. Finally, experimental results show that our two-layered recognition system is able to provide high reading miscue detection rates, while only few correctly read words are erroneously tagged as miscue.